Demographic modification of gene expression in predicting diffuse large B cell lymphoma (DLBCL) treatment response
MSt Healthcare Data Science
Authentication of practice
I confirm that I have fully read and understood the assignment
brief for this module.
Y
Details
Name: Chung Yan
Surname: Yu
Submission Date
19-10-25
Word Count: whole assignment including codes
Word Count: main body excluding abstract, references and
supplementary materials
University statement of originality
This assignment is the result of my own work and includes nothing
which is the outcome of work done in collaboration except as declared in
the Preface and specified in the text. It is not substantially the same
as any that I have previously submitted for a degree or diploma or other
qualification at the University of Cambridge or any other university or
similar institution, or that is being concurrently submitted, except as
declared in the Preface and specific in the text. I further state that
no substantial part of my Portfolio has already been submitted, nor is
being concurrently submitted for any such degree, diploma or other
qualification at the University of Cambridge or any other university or
similar institution except as declared in the Preface and specified in
the text.
I confirm the statement of originality as above
Y
Questions for reflection
Self-assessment is an important aspect of feedback literacy, which
is, in turn, key to the development of expertise. As you proceed through
the MSt Healthcare data science programme, we hope that you will make
use of the following prompts to assess your own work on assignments.
Specific assignment briefs will likely indicate which of these to
address for which assessments, but, in general, we expect you to respond
to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it
brief. For each question you answer, limit yourself to no more than
three items. And please remember, this is optional and developmental:
these cover sheets are designed to create space for self-assessment and
feedback dialogue, rather than additional assignment workload.
- Which aspects of this assignment are you most uncertain about and/or
would most like to receive feedback on?
- What elements are you left pondering after this assignment that you
would like to discuss further?
- How have you incorporated feedback from peers and tutors into this
assignment?
- How, and to what extent, have you been able to incorporate feedback
on previous course work into this assignment?
- Using the wording in the rubric, how would you describe the quality
of the different aspects of your work?
Declaration of the use of generative AI
Which permitted use of generative AI are you acknowledging?
Semantic search of literature and notes, outline creation, code
debugging, output formatting and generation, informed feedback
Which generative AI tool did you use (name and version)?
Claude Code v2.0.3x (the version likely changed over usage), Claude
Desktop v1.0.3x with PubMed Connector, M365 CoPilot
What did you use the tool for?
Searching for relevant journal publications, sanity check on coding
strategies, collating notes in my Obsidian vault, debugging code by
parsing error messages
How have you used or changed the generative AI’s output
E.g. my collated notes always stay in point form so that I write out
the paragraphs in my own words. Feedback provided by the GenAI models
are weighed and assessed before being acted on. Code generated are
checked, e.g. it tried to run Levene test on a model fitted using
residuals as the response, this was rectified back to the relevant
response variable.
Abstract
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin
lymphoma, with prognosis influenced by demographic features (age,
gender), integrated scoring systems such as the International Prognostic
Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged
with strong DLBCL associations, and further studies have identified over
150 genetic drivers. Yet whether these genetic markers’ effects differ
across demographic subgroups remains understudied. This study analysed
the gene expression data of 21 genes from 775 DLBCL patients to
investigate whether age and gender modify gene expression-response
associations. Hierarchical logistic regression models tested
demographic-gene interactions across 21 genes and 2 demographic factors
(age and gender, totalling 42). Only CCND2 showed significant
interaction effects (Likelihood ratio test adjusted p = 0.006) with age
on predicting complete response. Older age groups showed significant
increase (61-75 odds ratio p = 0.014; >75 odds ratio p = 0.001) in
complete response probability with higher CCND2 expression, while
younger patients (≤40, 41-60) showed the opposite insignificant trend.
Gender revealed no main or interaction effects across all genes. MYC
showed significant prognostic effects (likelihood ratio test adjusted p
= 0.010) with response completeness that were consistent across all
demographic groups, with no interaction effects. With only 1 of 42
tested demographic-gene combinations demonstrating interaction effects,
the demographic modifying effects remain a rare phenomenon amongst
genetic markers examined. This suggests demographic stratification may
generally be unnecessary for genetic markers. The age-dependent effects
of CCND2 could however provide a starting point to reconcile conflicting
prognostic associations of CCND2 expression levels in existing
literature, although further validation in other independent cohorts and
mechanistic confirmation are required.
Introduction
Diffuse large B cell lymphoma (DLBCL) is the most common type of
aggressive non-Hodgkin’s lymphoma (NHL)1,2,
with a wide range of factors contributing to prognosis. For a comparable
and comprehensive pre-treatment prognosis model, the international
prognostic index (IPI) emerged as a standardised model for DLBCL and
other aggressive NHLs3, combining 5 factors: age, Ann
Arbor stage classification, ECOG Performance Status, serum lactate
dehydrogenase (LDH) level and the number of extranodal sites of disease.
By pooling factors together, the IPI outperformed simple disease stage
classification models such as Ann Arbor.4 went on to improve on
the IPI, creating the National Comprehensive Cancer Network
International Prognostic Index (NCCN-IPI) as an update to the IPI. A key
innovation of the NCCN-IPI was using cubic splines to model the
continuous factors of age and LDH levels as higher resolution
categorical factors, modifying them from binary factors to quaternary
and ternary factors respectively. As prognosis models improved, so did
treatment. The first generation chemotherapy of cyclophosphamide,
doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of
30-35%5. With the addition of
rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes
curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI),
other demographic factors such as gender are not, despite established
evidence for their prognostic involvement. Males demonstrate higher
hazard ratios7 and gender-associated
pharmacokinetic differences in rituximab lead to worse treatment
responses in males8. These demographic
effects raise questions about whether risk factors perform uniformly
across patient subgroups. Beyond demographic and clinical factors, a
series of genetic alterations have emerged as strong prognostic markers:
BCL29, BCL610
and MYC11. These changes are
observed at both the genotypic and phenotypic levels, as translocation
mutations and protein overexpression respectively12. Furthermore,
patients with varying combinations of these alterations have shown worse
responses to R-CHOP treatments, highlighting the clinical relevance of
genomic profiling. Some genetic markers have shown age-associated
expression patterns, and certain genetic drivers (BCL6, IRF4) lose
prognostic significance when age is incorporated in multivariate
models13. This
indicates that demographic factors may not merely add to genomic risk
additively, but could modify how these genomic drivers influence
clinical outcomes.
With the advent of machine learning methods came attempts to
construct models based on clinical and genomic data.14
performed whole exome and transcriptome sequencing in 1001 DLBCL
patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC
alongside numerous newly characterised candidates. By combining
genotypic alterations, gene expression data and cell-of-origin
classification, they developed a genomic risk model that outperformed
the IPI. However, the potential for demographic factors to modify these
genomic associations remains understudied. Given that age and gender
both demonstrate prognostic significance and influence treatment
response, it remains unknown whether genomic risk markers perform
uniformly across demographic subgroups. If these demographic factors
modify how gene expression levels impact treatment response, such that
the same gene expression levels in different gender or age groups lead
to different prognostic outcomes, it would be crucial that these
differences are identified and applied in future models to prevent
inaccurate prognoses.
This study addresses this gap by investigating whether age and gender
interact and modify the prognostic associations between gene expression
markers and therapy response in DLBCL. The associations between the
demographic factors (age & gender) with complete therapy response
and the gene expression levels of 21 genes were first considered.
Logistic regression models using gene expression levels of 21 genes were
fitted with therapy response, and the interaction of demographic factors
with these gene expression levels were examined. Exploring how
demographic factors interact with known and potential genomic drivers of
DLBCL could reveal age or gender-dependent tumour biology that inform
therapeutic approaches.
The analysis utilises clinical and gene expression data15 from Reddy et al.’s
genomic risk model study14 of 1001 DLBCL patients,
focusing on 775 patients with complete gene expression profiles for 21
genes including the big three oncogenic drivers (MYC, BCL2, BCL6) and
cell-of-origin classifiers. Detailed data breakdown and processing are
documented in the Methods section.
Methods
Data Availability & Access
The datasets were publicly available on the ScienceDirect webpage
of Reddy et. al’s study14, and the corresponding
author was contacted via email to inform them of this secondary
analysis. A public GitHub
repository was created to host the processed datasets, attribution,
license information and supplementary materials to ensure data openness,
transparency and reproducibility. For the original study, the authors
obtained anonymised patient clinical information and tumours, which were
processed in line with a protocol approved by the Institutional Review
Board at Duke University. Thus no further anonymisation is required. A
total number of 2 datasets were downloaded as Excel files directly from
their links on the aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package16 and read via the
readxl package17. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R19). The
processed .csv files were uploaded to this project’s GitHub
repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the sequenced scripts in supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted20 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
Data Cleaning & Selection
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by NCCN-IPI4: ≤40, 41 - 60, 61 - 75,
>75. complete_response was added by combining “Partial
response” and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression columns for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available from
the original study’s datasets (S1
& S2),
representing 2 categories. The first category is the “big three”
oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic
markers. Moreover, MYC21 and BCL613 have both
shown age-dependent prognostic effects, suggesting age may modify how
these markers influence clinical outcomes. The second category comprises
19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22,
with BCL6 appearing in both categories. Despite their initial utility
for classification, genes such as CCND2 and LMO223 have shown to be
strong survival predictors alongside BCL2 and BCL6. More importantly,
IRF4 is among the genetic features showing age associations and
contributes to genetic complexity that loses prognostic significance
when age is incorporated13, alongside
BCL6. Both categories therefore contain genes with potential to interact
with demographic factors, making them suitable for testing whether
demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
Characteristic
Total (N=775)
Complete Response
Incomplete Response
Sample size
N = 775
n = 598
n = 126
Gender
Female
338 (43.6%)
258 (43.1%)
54 (42.9%)
Male
437 (56.4%)
340 (56.9%)
72 (57.1%)
Age at diagnosis, years
61 ± 15.4
59.8 ± 15.7
63.1 ± 14.3
Age groups (NCCN-IPI)
≤40 years
75 (10.1%)
67 (11.5%)
8 (6.5%)
41-60 years
249 (33.5%)
202 (34.8%)
41 (33.1%)
61-75 years
285 (38.3%)
218 (37.5%)
47 (37.9%)
>75 years
135 (18.1%)
94 (16.2%)
28 (22.6%)
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
Exploratory Statistics
Baseline gene expression patterns across demographic groups were
examined using t-test (gender) and one-way Welch’s ANOVA25 (age
groups) with false discovery rate (FDR)26 correction
due to the number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, associations between therapy response completeness and
demographic factors were examined with the Chi-squared tests, with
expected frequencies checked to be \(>
5\) (see 05-response-analysis.R).
Interaction Effects
The baseline information from Exploratory Statistics only describes
demographic-associated patterns, but not whether demographic factors
modify the prognostic associations of gene expression. To systematically
isolate and analyse these prognostic associations of gene expression,
demographic factors and the interaction between the two, hierarchical
logistic regression models28 were fitted for all 21 genes and 2
demographic factors (42 combinations). Each combination had four
logistic regression models fitted to identify significant terms that
contribute to therapy response completeness, the models were nested
as:
- Null model: therapy response ~ 1
- Gene-only model: therapy response ~ gene expression
- Main effect model: therapy response ~ gene expression + demographic
factors (age or gender)
- Interaction model: therapy response ~ gene expression + demographic
factors (age or gender) + gene-demographic interactions
All fitted models were analysed with likelihood-ratio test (LRT) to
obtain FDR-corrected p-values. Significant models were then checked for
model fit with goodness-of-fit with Chi-square tests and Akaike
Information Criterion (AIC) scores, along with influential observations
and collinearity29 (see 06-effect-analysis.R.)
Results
Exploratory Associations
Baseline Gene Expression Patterns by Demographics
Gene expression levels showed no significant differences across
gender groups after t-test with FDR correction. Ten genes showed
significantly different expression levels after Welch’s ANOVA with FDR
correction across NCCN-IPI age groups: BCL2 (p = 0.0002), BLNK (p =
0.003), MYBL1 (p = 0.010), SH3BP5 (p = 0.016), ITPKB (p = 0.020), CCND2
(p = 0.020), PTPN1 (p = 0.031), BMF (p = 0.034), FUT8 (p = 0.037), LMO2
(p = 0.038). Post-hoc analysis with Games-Howell revealed that age
groups mostly clustered into two statistically distinct groups for each
gene, save for BLNK with three groups Figure
132.
Figure 1. Gene Expression Patterns Across Age
Groups.
Fig. 1
Box plots show gene expression
distribution by age group for genes with significant Welch’s ANOVA
results (FDR-adjusted p < 0.05).
Compact letter display (CLD)
letters (a, b, c) above each boxplot indicate statistical groupings from
Games-Howell post-hoc tests within each gene; groups with different
letters differ significantly (p < 0.05).
Sample sizes (n) are
shown below each age group.
Outliers are highlighted in darker
borders.
Therapy Response by Demographics
Neither gender (p = 1.000) nor age group (p = 0.173) showed
significant associations with complete response, although incomplete
response proportions did trend upwards as age increased. While neither
demographic factor showed prognostic effect on treatment response, the
key question whether they modify the gene expression-therapy response
association is addressed in the section
below.
Gene Expression & Demographic Interactions in Predicting
Response
Gene-only Effects
As part of fitting hierarchical models for each gene with either age
or gender groups, gene-only models predicting therapy response (i.e. no
demographic modification) were also fitted (Figure
2). One gene out of 21, MYC, displayed significant gene-only effects
(LRT FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square
goodness-of-fit p-value = 0.947), with higher expression predicting
lower complete response probability. With the lowest AIC score among the
8 MYC models, adding demographic or demographic-MYC interaction did not
improve model fit, indicating MYC’s prognostic effect is consistent
across age and gender groups (Figure 3a).
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Demographic Modification Effects
Systematic testing of 21 genes across both age and gender yielded 42
gene-demographic models, of which no gender model showed significant
effect. Neither gene-gender interaction nor gender main effect models
attained significant results for any gene after FDR adjustment. For age,
only the CCND2 gene showed significant (LRT FDR-adjusted p-value =
0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects
amongst the interaction models of all 21 genes, and the rest 20 genes
showed neither main (additive) nor interactive effects with age. Having
the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models
indicate that response completeness is best predicted with gene
expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels
affecting the response completeness of the age groups differently. The
curves for age groups ≤40 and 41-60 and the curves for age groups 61-75
and >75 are going in opposite directions. Complete response
probability decreases with CCND2 expression level for the 2 younger
groups, while complete response probability increases with CCND2
expression level for the 2 older groups. Further scrutinising the odds
ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed
significantly higher complete response odds (61-75 p = 0.014; >75 p =
0.001), and younger age groups (≤40, 41-60) have trends for worse but
statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression
Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying
Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups
Orange:
Significant
Lilac: Insignificant
Confidence interval: 95%
Both
older age groups showed significant modifying effect.
Discussion
This systematic analysis tested whether demographic factors modify
gene expression associations with treatment response across 21 genes and
2 demographic factors (42 total combinations). Only CCND2 showed
significant age interaction effects, while MYC showed significant
gene-only effects. Gender revealed neither main nor interaction effects
for any gene. The rarity (1 out of 42 combinations) suggests demographic
stratification could be generally unneeded for most genetic markers,
with CCND2 representing an exception pending validation in other
cohorts.
CCND2’s age modifying effects significantly associate higher
expression with better outcomes in older patients (61-75, >75) but
younger patients have an opposite yet statistically insignificant trend.
This represents a novel finding with no precedent in literature. While
it is established that CCND2 is invovled in the cell cycle and is also a
target for the MYC gene34, no prior studies
have demonstrated age-dependent prognostic association, and the
underlying biological mechanism remains unclear and requires
investigation. This finding also provides a possible opening to
reconcile conflicting CCND2 literature, where studies have reported both
positive35 and negative37
prognoses associations. These differences could be observing genuinely
different effects rather than contradicting results. A reanalysis of
these datasets while factoring age effects could be good first step to
validate this premise.
MYC’s consistent effects across all demographic groups validates its
established oncogene status38 and its role
as a universal prognostic marker, contrasting CCND2’s age-dependence.
The other two of the “big three” oncogenes BCL2 and BCL6 only showed
marginal gene-only effects that lost significance after FDR
correction.
The lack of gender effects aligns with the findings of NCCN-IPI’s
authors, where they found no substantial enhancement to their model when
including gender4, and further validates
gender’s exclusion from the index.
While the adequate sample size \((n=775)\) allowed for systematic testing of
42 combinations, several limitations warrant consideration. As a study
on demographic effects, ethnicity information was absent, despite
probable international patient involvement. Gene expression data was
limited to 21 genes pre-selected for publication, missing other
potential genes and interactions. It would also be of great boon to the
study to be able to benchmark the analysis here with Reddy’s genomic
risk model. Unfortunately, the (hyper-)parameters required to rebuild
such model along with the interactive webtool for this model are no
longer accessible. This highlights the need for reproducibility in
cancer research39 and
bioinformatics40 in general,
and make me appreciate the emphasis of robust replicability our
lecturers had imparted on us during the first module.
Future opportunites include validating CCND2’s age interaction
effects in other cohorts, and reanalysing studies with conflicting CCND2
prognosis associations with age stratification. After validation in
different cohorts, perhaps mechanistic studies could investigate the
biology behind this age interaction effect, alongside broader screening
for demographic interactions across potential genomic markers.
Conclusion
This analysis of 775 DLBCL patients discovered 10 genes BCL2, BLNK,
MYBL1, SH3BP5, ITPKB, CCND2, PTPN1, BMF, FUT8 and LMO2, showed
age-dependent gene expression levels. One gene CCND2 showed modifying
effects with age on predicting therapy outcomes, and the MYC gene showed
gene only effects with age. The discovery of CCND2’s modifying effect
could partially resolve the conflicting accounts of how CCND2 activity
associate with prognosis. This is an exciting find that still needs work
to replicate and validate.
References
1. The
Non-Hodgkin’s Lymphoma Classification Project. A clinical evaluation
of the international lymphoma study group classification of
non-hodgkin’s lymphoma. Blood 89,
3909–3918 (1997).
2. Wang, S. S. Epidemiology
and etiology of diffuse large b-cell lymphoma. Seminars in
Hematology 60, 255–266 (2023).
3. Shipp, M. A. et al. A predictive model
for aggressive non-hodgkin’s lymphoma. New England Journal of
Medicine 329, 987–994 (1993).
4. Zhou, Z. et al. An enhanced
international prognostic index (NCCN-IPI) for patients with diffuse
large b-cell lymphoma treated in the rituximab era. Blood
123, 837–842 (2014).
5. Fisher, R. I. et al. Comparison of a
standard regimen (CHOP) with three intensive chemotherapy regimens for
advanced non-hodgkin’s lymphoma. New England Journal of
Medicine 328, 1002–1006 (1993).
6. Johnson, N. A. et al. Concurrent expression of
MYC and BCL2 in diffuse large b-cell lymphoma treated with rituximab
plus cyclophosphamide, doxorubicin, vincristine, and prednisone.
Journal of Clinical Oncology 30, 3452–3459
(2012).
7. Carella, A. M. et al. Prognostic role of
gender in diffuse large b-cell lymphoma treated with rituximab
containing regimens: A fondazione italiana linfomi/grupo de estudos em
moléstias onco-hematológicas retrospective study. Leukemia &
Lymphoma 54, 53–57 (2013).
8. Habermann, T. M. Is rituximab one for
all ages and each sex? Blood 123, 602–603
(2014).
9. Gascoyne, R. D. et al. Prognostic significance
of bcl-2 protein expression and bcl-2 gene rearrangement in diffuse
aggressive non-hodgkin’s lymphoma. Blood
90, 244–251 (1997).
10. Lo
Coco, F. et al. Rearrangements of
the BCL6 gene in diffuse large cell non-hodgkin’s lymphoma.
Blood 83, 1757–1759 (1994).
11. Chenevix-Trench, G., Behm, F. G. & Westin,
E. H. Somatic
rearrangement of the c-myc oncogene in primary human diffuse large-cell
lymphoma. International Journal of Cancer
38, 513–516 (1986).
12. Petrich, A. M., Nabhan, C. & Smith, S. M.
MYC-associated and
double-hit lymphomas: A review of pathobiology, prognosis, and
therapeutic approaches. Cancer 120,
3884–3895 (2014).
13. Klapper, W. et al. Patient age at
diagnosis is associated with the molecular characteristics of diffuse
large b-cell lymphoma. Blood 119,
1882–1887 (2012).
14. Reddy, A. et al. Genetic and functional
drivers of diffuse large b cell lymphoma. Cell
171, 481–494.e15 (2017).
15. Reddy, A. et al. Supplementary data
for "genetic and functional drivers of diffuse large b cell lymphoma":
Table S1 (clinical information and genetic alteration data for 1,001
DLBCL samples) and table S2 (ABC/GCB classification using gene
expression data). (2017).
16. R
Core Team. R: A Language and
Environment for Statistical Computing. (R Foundation for
Statistical Computing, Vienna, Austria, 2025).
17. Wickham, H. & Bryan, J. Readxl: Read
Excel Files. (2025). doi:10.32614/CRAN.package.readxl.
18. Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source
Software 4, 1686 (2019).
19. Müller, K. Here: A Simpler Way to Find Your
Files. (2025). doi:10.32614/CRAN.package.here.
20. Xie, Y. Knitr: A General-Purpose Package for
Dynamic Report Generation in r. (2025).
21. Kurz, K. S. et al. Age- and gender-specific
molecular characteristics of diffuse large b-cell lymphoma: Results from
clinical trials of the DSHNHL/GLA. HemaSphere
9, e70093 (2025).
22. Wright, G. et al. A gene expression-based
method to diagnose clinically distinct subgroups of diffuse large b cell
lymphoma. Proceedings of the National Academy of Sciences
100, 9991–9996 (2003).
23. Lossos, I. S. et al. Prediction of survival in
diffuse large-b-cell lymphoma based on the expression of six genes.
New England Journal of Medicine 350, 1828–1837
(2004).
24. Zhu, H. kableExtra: Construct Complex Table
with ’Kable’ and Pipe Syntax. (2024). doi:10.32614/CRAN.package.kableExtra.
25. Kassambara, A. Rstatix: Pipe-Friendly
Framework for Basic Statistical Tests. (2025). doi:10.32614/CRAN.package.rstatix.
26. Reiner, A., Yekutieli, D. & Benjamini, Y.
Identifying
differentially expressed genes using false discovery rate controlling
procedures. Bioinformatics 19, 368–375
(2003).
27. Rongen, M. van et al. Core statisics.
(2025).
28. Robinson, D., Hayes, A. & Couch, S.
Broom: Convert Statistical Objects into Tidy Tibbles. (2025).
doi:10.32614/CRAN.package.broom.
29. Hodgson, V., Castle, M., Nicholls, R. &
Rongen, M. van. Generalised linear
models. (2025).
30. Sievert, C. Interactive Web-Based Data Visualization
with r, Plotly, and Shiny. (Chapman; Hall/CRC, 2020).
31. Graves, S., Piepho, H.-P. & Sundar
Dorai-Raj, L. S. with help from. multcompView: Visualizations of
Paired Comparisons. (2024). doi:10.32614/CRAN.package.multcompView.
32. Neuwirth, E. RColorBrewer: ColorBrewer
Palettes. (2022). doi:10.32614/CRAN.package.RColorBrewer.
33. Lenth, R. V. & Piaskowski, J. Emmeans:
Estimated Marginal Means, Aka Least-Squares Means. (2025). doi:10.32614/CRAN.package.emmeans.
34. Bouchard, C. et al. Regulation of cyclin D2 gene
expression by the myc/max/mad network: Myc-dependent TRRAP recruitment
and histone acetylation at the cyclin D2 promoter. Genes &
Development 15, 2042–2047 (2001).
35. Wang, D., Zhang, Y. & Che, Y.-Q. CCND2 mRNA expression is
correlated with r-CHOP treatment efficacy and prognosis in patients with
ABC-DLBCL. Frontiers in Oncology Volume 10 -
2020, (2020).
36. Hans, C. P. et al. Confirmation of the
molecular classification of diffuse large b-cell lymphoma by
immunohistochemistry using a tissue microarray. Blood
103, 275–282 (2004).
37. Yi,
L. et al. Palbociclib enhances
the effect of doxorubicin-induced apoptosis in activated b-cell–like
diffuse large b-cell lymphoma cells. Anti-Cancer Drugs
34, 257–268 (2023).
38. Dhanasekaran, R. et al. The MYC oncogene — the
grand orchestrator of cancer growth and immune evasion. Nature
Reviews Clinical Oncology 19, 23–36 (2022).
39. Begley, C. G. & Ellis, L. M. Raise standards for preclinical
cancer research. Nature 483, 531–533
(2012).
40. Ziemann, M., Poulain, P. & Bora, A. The five pillars of
computational reproducibility: Bioinformatics and beyond.
Briefings in Bioinformatics 24, bbad375
(2023).
Authentication of practice
| I confirm that I have fully read and understood the assignment brief for this module. | Y |
Details
| Name: Chung Yan | Surname: Yu |
| Submission Date | 19-10-25 |
| Word Count: whole assignment including codes | |
| Word Count: main body excluding abstract, references and supplementary materials |
University statement of originality
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
| I confirm the statement of originality as above | Y |
Questions for reflection
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
- Which aspects of this assignment are you most uncertain about and/or would most like to receive feedback on?
- What elements are you left pondering after this assignment that you would like to discuss further?
- How have you incorporated feedback from peers and tutors into this assignment?
- How, and to what extent, have you been able to incorporate feedback on previous course work into this assignment?
- Using the wording in the rubric, how would you describe the quality of the different aspects of your work?
Declaration of the use of generative AI
| Which permitted use of generative AI are you acknowledging? | Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback |
| Which generative AI tool did you use (name and version)? | Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot |
| What did you use the tool for? | Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages |
| How have you used or changed the generative AI’s output | E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable. |
Abstract
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by demographic features (age, gender), integrated scoring systems such as the International Prognostic Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged with strong DLBCL associations, and further studies have identified over 150 genetic drivers. Yet whether these genetic markers’ effects differ across demographic subgroups remains understudied. This study analysed the gene expression data of 21 genes from 775 DLBCL patients to investigate whether age and gender modify gene expression-response associations. Hierarchical logistic regression models tested demographic-gene interactions across 21 genes and 2 demographic factors (age and gender, totalling 42). Only CCND2 showed significant interaction effects (Likelihood ratio test adjusted p = 0.006) with age on predicting complete response. Older age groups showed significant increase (61-75 odds ratio p = 0.014; >75 odds ratio p = 0.001) in complete response probability with higher CCND2 expression, while younger patients (≤40, 41-60) showed the opposite insignificant trend. Gender revealed no main or interaction effects across all genes. MYC showed significant prognostic effects (likelihood ratio test adjusted p = 0.010) with response completeness that were consistent across all demographic groups, with no interaction effects. With only 1 of 42 tested demographic-gene combinations demonstrating interaction effects, the demographic modifying effects remain a rare phenomenon amongst genetic markers examined. This suggests demographic stratification may generally be unnecessary for genetic markers. The age-dependent effects of CCND2 could however provide a starting point to reconcile conflicting prognostic associations of CCND2 expression levels in existing literature, although further validation in other independent cohorts and mechanistic confirmation are required.
Introduction
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL)1,2, with a wide range of factors contributing to prognosis. For a comparable and comprehensive pre-treatment prognosis model, the international prognostic index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs3, combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor.4 went on to improve on the IPI, creating the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) as an update to the IPI. A key innovation of the NCCN-IPI was using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35%5. With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios7 and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males8. These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL29, BCL610 and MYC11. These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively12. Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some genetic markers have shown age-associated expression patterns, and certain genetic drivers (BCL6, IRF4) lose prognostic significance when age is incorporated in multivariate models13. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.
With the advent of machine learning methods came attempts to construct models based on clinical and genomic data.14 performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.
This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.
The analysis utilises clinical and gene expression data15 from Reddy et al.’s genomic risk model study14 of 1001 DLBCL patients, focusing on 775 patients with complete gene expression profiles for 21 genes including the big three oncogenic drivers (MYC, BCL2, BCL6) and cell-of-origin classifiers. Detailed data breakdown and processing are documented in the Methods section.
Methods
Data Availability & Access
The datasets were publicly available on the ScienceDirect webpage
of Reddy et. al’s study14, and the corresponding
author was contacted via email to inform them of this secondary
analysis. A public GitHub
repository was created to host the processed datasets, attribution,
license information and supplementary materials to ensure data openness,
transparency and reproducibility. For the original study, the authors
obtained anonymised patient clinical information and tumours, which were
processed in line with a protocol approved by the Institutional Review
Board at Duke University. Thus no further anonymisation is required. A
total number of 2 datasets were downloaded as Excel files directly from
their links on the aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package16 and read via the
readxl package17. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R19). The
processed .csv files were uploaded to this project’s GitHub
repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the sequenced scripts in supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted20 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
Data Cleaning & Selection
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by NCCN-IPI4: ≤40, 41 - 60, 61 - 75,
>75. complete_response was added by combining “Partial
response” and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression columns for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available from the original study’s datasets (S1 & S2), representing 2 categories. The first category is the “big three” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC21 and BCL613 have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22, with BCL6 appearing in both categories. Despite their initial utility for classification, genes such as CCND2 and LMO223 have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated13, alongside BCL6. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
| Characteristic | Total (N=775) | Complete Response | Incomplete Response |
|---|---|---|---|
| Sample size | N = 775 | n = 598 | n = 126 |
| Gender | |||
| Female | 338 (43.6%) | 258 (43.1%) | 54 (42.9%) |
| Male | 437 (56.4%) | 340 (56.9%) | 72 (57.1%) |
| Age at diagnosis, years | 61 ± 15.4 | 59.8 ± 15.7 | 63.1 ± 14.3 |
| Age groups (NCCN-IPI) | |||
| ≤40 years | 75 (10.1%) | 67 (11.5%) | 8 (6.5%) |
| 41-60 years | 249 (33.5%) | 202 (34.8%) | 41 (33.1%) |
| 61-75 years | 285 (38.3%) | 218 (37.5%) | 47 (37.9%) |
| >75 years | 135 (18.1%) | 94 (16.2%) | 28 (22.6%) |
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
Exploratory Statistics
Baseline gene expression patterns across demographic groups were
examined using t-test (gender) and one-way Welch’s ANOVA25 (age
groups) with false discovery rate (FDR)26 correction
due to the number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, associations between therapy response completeness and
demographic factors were examined with the Chi-squared tests, with
expected frequencies checked to be \(>
5\) (see 05-response-analysis.R).
Interaction Effects
The baseline information from Exploratory Statistics only describes demographic-associated patterns, but not whether demographic factors modify the prognostic associations of gene expression. To systematically isolate and analyse these prognostic associations of gene expression, demographic factors and the interaction between the two, hierarchical logistic regression models28 were fitted for all 21 genes and 2 demographic factors (42 combinations). Each combination had four logistic regression models fitted to identify significant terms that contribute to therapy response completeness, the models were nested as:
- Null model: therapy response ~ 1
- Gene-only model: therapy response ~ gene expression
- Main effect model: therapy response ~ gene expression + demographic factors (age or gender)
- Interaction model: therapy response ~ gene expression + demographic factors (age or gender) + gene-demographic interactions
All fitted models were analysed with likelihood-ratio test (LRT) to
obtain FDR-corrected p-values. Significant models were then checked for
model fit with goodness-of-fit with Chi-square tests and Akaike
Information Criterion (AIC) scores, along with influential observations
and collinearity29 (see 06-effect-analysis.R.)
Results
Exploratory Associations
Baseline Gene Expression Patterns by Demographics
Gene expression levels showed no significant differences across gender groups after t-test with FDR correction. Ten genes showed significantly different expression levels after Welch’s ANOVA with FDR correction across NCCN-IPI age groups: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.010), SH3BP5 (p = 0.016), ITPKB (p = 0.020), CCND2 (p = 0.020), PTPN1 (p = 0.031), BMF (p = 0.034), FUT8 (p = 0.037), LMO2 (p = 0.038). Post-hoc analysis with Games-Howell revealed that age groups mostly clustered into two statistically distinct groups for each gene, save for BLNK with three groups Figure 132.
Figure 1. Gene Expression Patterns Across Age Groups.
Fig. 1
Box plots show gene expression
distribution by age group for genes with significant Welch’s ANOVA
results (FDR-adjusted p < 0.05).
Compact letter display (CLD)
letters (a, b, c) above each boxplot indicate statistical groupings from
Games-Howell post-hoc tests within each gene; groups with different
letters differ significantly (p < 0.05).
Sample sizes (n) are
shown below each age group.
Outliers are highlighted in darker
borders.
Therapy Response by Demographics
Neither gender (p = 1.000) nor age group (p = 0.173) showed significant associations with complete response, although incomplete response proportions did trend upwards as age increased. While neither demographic factor showed prognostic effect on treatment response, the key question whether they modify the gene expression-therapy response association is addressed in the section below.
Gene Expression & Demographic Interactions in Predicting Response
Gene-only Effects
As part of fitting hierarchical models for each gene with either age or gender groups, gene-only models predicting therapy response (i.e. no demographic modification) were also fitted (Figure 2). One gene out of 21, MYC, displayed significant gene-only effects (LRT FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square goodness-of-fit p-value = 0.947), with higher expression predicting lower complete response probability. With the lowest AIC score among the 8 MYC models, adding demographic or demographic-MYC interaction did not improve model fit, indicating MYC’s prognostic effect is consistent across age and gender groups (Figure 3a).
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Demographic Modification Effects
Systematic testing of 21 genes across both age and gender yielded 42 gene-demographic models, of which no gender model showed significant effect. Neither gene-gender interaction nor gender main effect models attained significant results for any gene after FDR adjustment. For age, only the CCND2 gene showed significant (LRT FDR-adjusted p-value = 0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects amongst the interaction models of all 21 genes, and the rest 20 genes showed neither main (additive) nor interactive effects with age. Having the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models indicate that response completeness is best predicted with gene expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels affecting the response completeness of the age groups differently. The curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 are going in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. Further scrutinising the odds ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed significantly higher complete response odds (61-75 p = 0.014; >75 p = 0.001), and younger age groups (≤40, 41-60) have trends for worse but statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups
Orange:
Significant
Lilac: Insignificant
Confidence interval: 95%
Both
older age groups showed significant modifying effect.
Discussion
This systematic analysis tested whether demographic factors modify gene expression associations with treatment response across 21 genes and 2 demographic factors (42 total combinations). Only CCND2 showed significant age interaction effects, while MYC showed significant gene-only effects. Gender revealed neither main nor interaction effects for any gene. The rarity (1 out of 42 combinations) suggests demographic stratification could be generally unneeded for most genetic markers, with CCND2 representing an exception pending validation in other cohorts.
CCND2’s age modifying effects significantly associate higher expression with better outcomes in older patients (61-75, >75) but younger patients have an opposite yet statistically insignificant trend. This represents a novel finding with no precedent in literature. While it is established that CCND2 is invovled in the cell cycle and is also a target for the MYC gene34, no prior studies have demonstrated age-dependent prognostic association, and the underlying biological mechanism remains unclear and requires investigation. This finding also provides a possible opening to reconcile conflicting CCND2 literature, where studies have reported both positive35 and negative37 prognoses associations. These differences could be observing genuinely different effects rather than contradicting results. A reanalysis of these datasets while factoring age effects could be good first step to validate this premise.
MYC’s consistent effects across all demographic groups validates its established oncogene status38 and its role as a universal prognostic marker, contrasting CCND2’s age-dependence. The other two of the “big three” oncogenes BCL2 and BCL6 only showed marginal gene-only effects that lost significance after FDR correction.
The lack of gender effects aligns with the findings of NCCN-IPI’s authors, where they found no substantial enhancement to their model when including gender4, and further validates gender’s exclusion from the index.
While the adequate sample size \((n=775)\) allowed for systematic testing of 42 combinations, several limitations warrant consideration. As a study on demographic effects, ethnicity information was absent, despite probable international patient involvement. Gene expression data was limited to 21 genes pre-selected for publication, missing other potential genes and interactions. It would also be of great boon to the study to be able to benchmark the analysis here with Reddy’s genomic risk model. Unfortunately, the (hyper-)parameters required to rebuild such model along with the interactive webtool for this model are no longer accessible. This highlights the need for reproducibility in cancer research39 and bioinformatics40 in general, and make me appreciate the emphasis of robust replicability our lecturers had imparted on us during the first module.
Future opportunites include validating CCND2’s age interaction effects in other cohorts, and reanalysing studies with conflicting CCND2 prognosis associations with age stratification. After validation in different cohorts, perhaps mechanistic studies could investigate the biology behind this age interaction effect, alongside broader screening for demographic interactions across potential genomic markers.
Conclusion
This analysis of 775 DLBCL patients discovered 10 genes BCL2, BLNK, MYBL1, SH3BP5, ITPKB, CCND2, PTPN1, BMF, FUT8 and LMO2, showed age-dependent gene expression levels. One gene CCND2 showed modifying effects with age on predicting therapy outcomes, and the MYC gene showed gene only effects with age. The discovery of CCND2’s modifying effect could partially resolve the conflicting accounts of how CCND2 activity associate with prognosis. This is an exciting find that still needs work to replicate and validate.